feat: estimate cycles#17
Conversation
4b2e2b7 to
22e0934
Compare
Greptile SummaryThis PR adds per-instruction cycle estimation to Callgrind by integrating Capstone for real-time instruction decoding and generated LUT files (
Confidence Score: 5/5Safe to merge; both findings are non-blocking edge cases that do not affect the common execution path. The core decode-and-lookup pipeline, cost accumulation in cachesim_add_icost and setup_bbcc, and the build system wiring are all correct. Both findings are narrow edge cases that do not affect normal amd64/arm64 operation. callgrind/main.c (zero-length IMark handling) and callgrind/cycledecode_capstone.h (i386 mode guard) Important Files Changed
|
650e97b to
86ac213
Compare
GuillaumeLagrange
left a comment
There was a problem hiding this comment.
olgtm, do we have internal documentation on how we generated the LUT?
+ for curiosity: why do we need capstone? It could be made a bit clearer.
My understanding is that it's used to transform the instruction's operation to derive the ID for the LUT?
We need to build a LUT for each instruction, but we can't just take the raw bytes as some have 64-bit immediate params. So what we have to extract the parts that identify an instruction. Intel's XED decoder has IFORM which would be super helpful here, as it can identify each instruction by it's category. For example, We can't use XED as it's only for x86_64 and not ARM. Which is why we manually reconstruct something similar with Capstone. |
Add the regenerated x86_caps_lut.inc / arm64_caps_lut.inc cost tables consumed by the --cycle-estimation runtime. - x86: Zen4-tuned reciprocal-throughput table. - arm64: measured Cortex-A72 table, with a hand-frozen guide supplement for ops that are not benchmarked.
…-bit Capstone The amd64 host builds both the primary (amd64) tool and a 32-bit x86 secondary tool, but Capstone is only built 64-bit, so CLG_WITH_CAPSTONE is set only for the primary build. The secondary build compiled cycledecode.c without it and tripped the mandatory-Capstone #error. CodSpeed only ever runs the 64-bit tool, so build 64-bit only everywhere: add --enable-only64bit to the CI configure, the release deb (debian/rules, now unconditional), and the Justfile, and drop the now-unneeded gcc-multilib / libc6-dev-i386 deps. This also roughly halves build time by skipping the entire 32-bit toolchain.
Callgrind's cycle estimation links a static Capstone decoder. Add a build step to ci, codspeed and release workflows that compiles Capstone 5.0.9 x86+arm64 only (other printers reference libc symbols the -nodefaultlibs tool does not shim) and without stack-protector/fortify (the tool runs without glibc's %fs TLS), then exports its prefix as CAPSTONE_DIR for configure to pick up. Add cmake to the apt deps and forward CAPSTONE_DIR through debuild -e in the release build.
… estimation configure.ac gains --with-capstone=PATH (defaulting to $CAPSTONE_DIR) and makes a static Capstone mandatory for the native tool, compiling the decoder with fortify disabled since it links -nodefaultlibs. Makefile.am adds the cycledecode sources/headers, ships the LUT .inc tables, and passes the Capstone CFLAGS/LIBS. debian/rules forwards CAPSTONE_DIR to configure via --with-capstone.
Decode the real guest bytes of each instruction (via Capstone) at first translation and look up reciprocal-throughput (Ct) and latency (Cl) estimates in the cost table. Register an EG_CYCLES event group exposing Ct/Cl, accumulate self cost in the cache simulator and running inclusive sums per BB so the call-graph cost at each side exit is an O(1) lookup. Falls back to a flat 1.00 cycle (with a warning) on decode failure or no table match, and disables itself if Capstone is unavailable for the guest.
1a36a0a to
2bc9a1c
Compare
No description provided.